NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems

Lou, Chang; Parikesit, Dimas Shidqi; Huang, Yujin; Yang, Zhewen; Diwangkara, Senapati; Jing, Yuzhuo; Kistijantoro, Achmad Imam; Yuan, Ding; Nath, Suman; Huang, Peng (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
more » « less
Free, publicly-accessible full text available July 7, 2026
Deriving semantic checkers from tests to detect silent failures in production distributed systems

Lou, Chang; Parikesit, Dimas Shidqi; Huang, Yujin; Yang, Zhewen; Diwangkara, Senapati; Jing, Yuzhuo; Kistijantoro, Achmad Imam; Yuan, Ding; Nath, Suman; Huang, Peng (July 2025, 19th USENIX Symposium on Operating Systems Design and Implementation)

Production distributed systems provide rich features, but various defects can cause a system to silently violate its semantics without explicit errors. Such failures cause serious consequences. Yet, they are extremely challenging to detect, as it requires deep domain knowledge and substantial manual efforts to write good checkers. In this paper, we explore a novel approach that directly derives semantic checkers from system test code. We first present a large-scale study on existing system test cases. Guided by the study findings, we develop T2C, a framework that uses static and dynamic analysis to transform and generalize a test into a runtime checker. We apply T2C on four large, popular distributed systems and successfully derive tens to hundreds of checkers. These checkers detect 15 out of 20 real-world silent failures we reproduce and incur small runtime overhead.
more » « less
Free, publicly-accessible full text available July 7, 2026
Operating System Support for Safe and Efficient Auxiliary Execution

Jing, Yuzhuo; Huang, Peng (July 2022, 16th USENIX Symposium on Operating Systems Design and Implementation)

Modern applications run various auxiliary tasks. These tasks gain high observability and control by executing in the application address space, but doing so causes safety and performance issues. Running them in a separate process offers strong isolation but poor observability and control. In this paper, we propose special OS support for auxiliary tasks to address this challenge with an abstraction called orbit. An orbit task offers strong isolation. At the same time, it conveniently observes the main program with an automatic state synchronization feature. We implement the abstraction in the Linux kernel. We use orbit to port 7 existing auxiliary tasks and add one new task in 6 large applications. The evaluation shows that the orbit-version tasks have strong isolation with comparable performance of the original unsafe tasks.
more » « less
Full Text Available
Demystifying and Checking Silent Semantic Violations in Large Distributed Systems

Lou, Chang; Jing, Yuzhuo; Huang, Peng (July 2022, 16th USENIX Symposium on Operating Systems Design and Implementation)

Distributed systems today offer rich features with numerous semantics that users depend on. Bugs can cause a system to silently violate its semantics without apparent anomalies. Such silent violations cause prolonged damage and are difficult to address. Yet, this problem is under-investigated. In this paper, we first study 109 real-world silent semantic failures from nine widely-used distributed systems to shed some light on this difficult problem. Our study reveals more than a dozen informative findings. For example, it shows that surprisingly the majority of the studied failures were violating semantics that existed since the system’s first stable release. Guided by insights from our study, we design Oathkeeper, a tool that automatically infers semantic rules from past failures and enforces the rules at runtime to detect new failures. Evaluation shows that the inferred rules detect newer violations, and Oathkeeper only incurs 1.27% overhead.
more » « less
Full Text Available
BORA: A Bag Optimizer for Robotic Analysis

https://doi.org/10.1109/SC41405.2020.00016

Zhang, Jian; Xie, Tao; Jing, Yuzhuo; Song, Yanjie; Hu, Guanzhou; Chen, Si; Yin, Shu (November 2020, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 2020))
null (Ed.)
Full Text Available

Search for: All records